docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes by AdaWorldAPI · Pull Request #178 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-20T14:46:18Z

Summary

TD-SIMD-8 (F16 honesty) + matrix audit for missing lane wrappers (U16/U32/U64, I4/8/16/32/64, F32 — user request).

F16 honesty

src/simd_half.rs::F16x16 — docstring now explicitly discloses scalar [u16; 16] storage and routes hot loops to core::simd::f16x16 (under nightly-simd) or to fp32 with conversion at boundaries.
Disambiguates from simd_avx2::F16Scaler — that's a scaling context for range-normalizing values before f16 encoding, NOT the F16x16 SIMD type. Both files now cross-reference each other.

Matrix corrections

Cross-referenced every pub struct *x* in simd_avx512.rs, simd_avx2.rs, simd_neon.rs, simd_nightly/mod.rs against the parity matrix. Found these gaps:

Change	Reason
`F32x8` v3: ❌ → ✅ `__m256` (in `simd_avx512`)	`src/simd.rs:294` already imports it on v3 path; it's AVX (not AVX-512), works Sandy Bridge+
`F64x4` v3: ❌ → ✅ `__m256d` (in `simd_avx512`)	Same as F32x8
`U32x8` row added	nightly-only; missing on x86 / aarch64 / scalar
`U64x4` row added	nightly-only
`U16x16` row added	missing on EVERY backend (incl. nightly)
`I32x8` row added	missing on EVERY backend (incl. nightly)
`I64x4` row added	missing on EVERY backend (incl. nightly)
`F32Mask8` row added	declared as `F32Mask8Scalar` in `simd_scalar`; not surfaced through `crate::simd::*`
`F64Mask4` row added	declared as `F64Mask4Scalar` in `simd_scalar`; not surfaced

Sub-byte lanes section added

I4 / U4 (4-bit nibbles) used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper exists anywhere — consumers pack 2× nibbles per byte and operate through U8x64 with shr_epi16 + & 0x0F masks. Documents the hardware story (AVX-512 VBMI2 VPCOMPRESSB, VPMADD52 on x86; shr+mask on aarch64). Tracked as TD-SIMD-11 if a consumer files for it.

TD-SIMD-8 row updated

§5 entry now points at src/simd_half.rs:123 (the actual F16x16 polyfill) rather than the unrelated F16Scaler at simd_avx2.rs:2566. Documents the three remediation options: (a) wire _mm256_cvtph_ps under target_feature = "f16c" (Ivy Bridge+; all AVX-512 hosts), (b) F16x16Scalar alias to make scalar nature explicit at consumer call sites, (c) type-level doc-warning. ~80 LoC estimate.

Test plan

Docs-only — cargo check paths unchanged.
cargo fmt --check clean (no Rust code changed beyond two doc comments).
CI green — no behavior change.

Generated by Claude Code

# F16 honesty (TD-SIMD-8) `src/simd_half.rs` F16x16: docstring now explicitly discloses scalar storage and routes hot loops to `core::simd::f16x16` (under `nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing values before f16 encoding, not the F16x16 SIMD type. Both files cross- reference each other so a future reader doesn't repeat the confusion. `src/simd_avx2.rs` F16Scaler: docstring strengthened with the same disambiguation note. # Matrix audit (user request) Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs, simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the architecture doc. Corrections: - **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**. The dispatch at `src/simd.rs:294` already imports these from simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they work on every Sandy Bridge+ host. The matrix was stale. - **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 + aarch64 + scalar. core::simd has them via `simd_nightly`. - **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend including nightly. Theoretical 256-bit shapes no consumer has reached for yet. - **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as `F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate- decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512 has them natively via `__mmask8` but they're not typed. - **Sub-byte lanes section added** — I4 / U4 lanes used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper; consumers pack 2× nibbles per byte and operate through U8x64 + shr/ mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer files for it. TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated F16Scaler scaling utility).

…ss all backends PR #178's matrix audit surfaced five 256-bit int lane types that were either entirely missing or stranded in `simd_nightly` only. Adds them across every backend so `crate::simd::{U16x16, U32x8, U64x4, I32x8, I64x4}` resolves uniformly on v3 / v4 / native / nightly / scalar / aarch64 paths. `src/simd_avx2.rs` + 5× `avx2_int_type!` instantiations producing scalar-storage `[$elem; $lanes]` polyfills (align 64). Same macro pattern as the existing 512-bit polyfills (U8x64, U16x32, …). Native AVX2 `__m256i` upgrades are TD-SIMD-3. + 5× lowercase aliases (`u16x16 = U16x16`, etc.) matching the std::simd convention used by every other lane type in the file. `src/simd_scalar.rs` + 5× `impl_int_type!` instantiations mirroring the AVX2 polyfills above. Consumers on non-x86/non-aarch64 (wasm32, riscv, thumb) reach the same type names through `crate::simd::*`. + Lowercase aliases. `src/simd_avx512.rs` + Re-export of the new types from `simd_avx2` so the v4 dispatch arm in `simd.rs` can surface them without forking the macro into this file. Both files are already gated on `target_arch = "x86_64"`, so the re-export is cheap. Native `__m256i` upgrades here are TD-SIMD-3 (same story as the v3 polyfills). `src/simd_nightly/u_word_types.rs` + `U16x16` wrapper backed by `core::simd::u16x16`. Same API surface as the existing 32-/16-/8-lane wrappers — splat, from_slice, from_array, to_array, copy_to_slice, reduce_{sum,min,max}, simd_min/max, cmpeq_mask, cmpgt_mask, Default. `src/simd_nightly/i_word_types.rs` + `I32x8` and `I64x4` wrappers backed by `core::simd::{i32x8, i64x4}`. Same API surface as siblings; PartialEq via array compare. `src/simd_nightly/mod.rs` + Re-exports for the three new types + lowercase aliases. `src/simd.rs` + All 5 dispatch arms (nightly, v4, v3, aarch64, scalar fallback) updated to surface the new types through `crate::simd::*`. `.claude/knowledge/simd-dispatch-architecture.md` + Parity matrix updated — the five rows previously marked ❌ across most backends now show 🟠 polyfill (v3, v4-via-v3, scalar) / 🔵 (nightly via `core::simd`). Verified: `cargo check` clean under default v3 features and under `-Ctarget-cpu=x86-64-v4` (via `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUSTFLAGS` + explicit `--target` so build scripts don't SIGILL on non-AVX-512 runners — same pattern as the tier4-avx512-check job).

AdaWorldAPI merged commit 2f096d3 into master May 20, 2026
16 checks passed

AdaWorldAPI mentioned this pull request May 20, 2026

feat(simd): missing-lanes sweep — U16x16/U32x8/U64x4/I32x8/I64x4 #179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes#178

docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes#178
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-td8-f16-honesty

AdaWorldAPI commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 20, 2026

Summary

F16 honesty

Matrix corrections

Sub-byte lanes section added

TD-SIMD-8 row updated

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants